2015-07-27 Creative Commons Attribution License

Topics

So far, we have focused on regression involving linear coefficients for numeric independent variables. Today, we turn our attention to new classes of regression terms:

  • "dummy" variables
  • categorical variables
  • interactions

Goals

After class today you will be able to

  • illustrate regression with dichotomous dummy variables in words, pictures, and equations
  • illustrate regression with polytomous dummy variables in words, pictures, and equations
  • illustrate regressions with interactions between a continuous variable and a dummy variable in words, pictures, and equations
  • identify substantive research questions that involve dummy variables and interactions
  • match theoretical models to interaction models
  • explain why it is important to include constitutive terms in a model with interactions
  • interpret models with interactions between dichotomous and continuous variables

dummy variables

We are going to be looking at the relationship between years of education and vocabulary. This data is originally from the GSS, and is obtained from John Fox's website.

data <- read.table("data/Vocabulary.txt", header=TRUE)
data <- tbl_df(data)
p <- ggplot(data, aes(x=education, y=vocabulary))
p + geom_jitter() + stat_smooth(method = "lm") + theme_bw()

p <- ggplot(data, aes(x=education))
p + geom_histogram(binwidth=1, colour="black", fill="dark grey", origin = -0.5) +
  scale_x_continuous(breaks=0:20) + theme_bw() + facet_grid(sex ~ .) 

p <- ggplot(data, aes(x=vocabulary))
p + geom_histogram(binwidth=1, colour="black", fill="white", origin = -0.5) +
  scale_x_continuous(breaks=0:10) + facet_grid(sex ~ .)  + 
  theme_economist_white()

p <- ggplot(data, aes(x=education, y=vocabulary, colour=sex))
p + geom_jitter() + scale_colour_brewer(palette=2,type='qual') + 
  theme_bw() + theme(legend.position = c(.9,.1)) 

fit <- lm(vocabulary ~ education + sex, data = data)
tidy(fit)
##          term   estimate   std.error statistic       p.value
## 1 (Intercept)  1.5080568 0.055641335 27.103174 4.086269e-159
## 2   education  0.3578652 0.004193481 85.338458  0.000000e+00
## 3     sexMale -0.2106902 0.025748042 -8.182766  2.925683e-16

What does it look like to draw this regression line?

(let's draw…)

p + geom_jitter(alpha = .1) + 
  scale_colour_brewer(palette=2,type='qual') + theme_bw() + 
  geom_smooth(aes(group=sex),method='lm',lty=2,size=2) +
theme(legend.position = c(.9,.1))

What could make this a bit easier?

  • ALWAYS RECODE BINARY VARIABLES SO THAT THE VALUES ARE 0/1

  • ALWAYS MAKE SURE THE CODING SCHEME CARRIES MNEMONIC SIGNIFICANCE SO THAT 1=YES AND 0=NO

  • YES THIS WARRANTS ALL-CAPS!

What is going on in this code?

data <- mutate(data, female = as.numeric(sex=="Female"))
head(data)
## Source: local data frame [6 x 6]
## 
##         id  year    sex education vocabulary female
##      (int) (int) (fctr)     (int)      (int)  (dbl)
## 1 20040001  2004 Female         9          3      1
## 2 20040002  2004 Female        14          6      1
## 3 20040003  2004   Male        14          9      0
## 4 20040005  2004 Female        17          8      1
## 5 20040008  2004   Male        14          1      0
## 6 20040010  2004   Male        14          7      0
  1. Logical sex=='Female' produces TRUE/FALSE vector
  2. R will interpret TRUE/FALSE as 1/0
  3. using dplyr::mutate, we make the variable female equal to 1 for females and 0 for males

Thought exercise:

  • Does recoding a dummy variable change a regression?

  • How or why?

How does our new model:

(fit.df <- tidy(lm(vocabulary ~ education + female, data = data)))
##          term  estimate   std.error statistic       p.value
## 1 (Intercept) 1.2973666 0.057841978 22.429499 3.656442e-110
## 2   education 0.3578652 0.004193481 85.338458  0.000000e+00
## 3      female 0.2106902 0.025748042  8.182766  2.925683e-16

compare to old model:

tidy(lm(vocabulary ~ education + sex, data = data))
##          term   estimate   std.error statistic       p.value
## 1 (Intercept)  1.5080568 0.055641335 27.103174 4.086269e-159
## 2   education  0.3578652 0.004193481 85.338458  0.000000e+00
## 3     sexMale -0.2106902 0.025748042 -8.182766  2.925683e-16

Dummy = Different Intercept

  • The concept to keep in mind is that fitting a dichotomous variable has the effect of fitting two different intercepts, one for each group

  • What we are testing is whether fitting a separate intercept adds sufficient explanatory power to warrent the additional complication

fit.df
##          term  estimate   std.error statistic       p.value
## 1 (Intercept) 1.2973666 0.057841978 22.429499 3.656442e-110
## 2   education 0.3578652 0.004193481 85.338458  0.000000e+00
## 3      female 0.2106902 0.025748042  8.182766  2.925683e-16

Is the separate intercept for females warranted here? Why or why not?

categorical variables

We are going to be looking at the relationship average years of education, average income, occupation category, and occupational prestige. This data is from the 1971 Canadian Census obtained from John Fox's website.

data <- read.table("data/Prestige.txt", header=TRUE)
head(data)
##                     education income women prestige census type
## GOV.ADMINISTRATORS      13.11  12351 11.16     68.8   1113 prof
## GENERAL.MANAGERS        12.26  25879  4.02     69.1   1130 prof
## ACCOUNTANTS             12.77   9271 15.70     63.4   1171 prof
## PURCHASING.OFFICERS     11.42   8865  9.11     56.8   1175 prof
## CHEMISTS                14.62   8403 11.68     73.5   2111 prof
## PHYSICISTS              15.64  11030  5.13     77.6   2113 prof
data <- tbl_df(data)
head(data)
## Source: local data frame [6 x 6]
## 
##   education income women prestige census   type
##       (dbl)  (int) (dbl)    (dbl)  (int) (fctr)
## 1     13.11  12351 11.16     68.8   1113   prof
## 2     12.26  25879  4.02     69.1   1130   prof
## 3     12.77   9271 15.70     63.4   1171   prof
## 4     11.42   8865  9.11     56.8   1175   prof
## 5     14.62   8403 11.68     73.5   2111   prof
## 6     15.64  11030  5.13     77.6   2113   prof

data <- read.table("data/Prestige.txt", header=TRUE)
data$occupation <- rownames(data)
rownames(data) <- NULL
data <- tbl_df(data)
data <- filter(data, !is.na(type))
head(data)
## Source: local data frame [6 x 7]
## 
##   education income women prestige census   type          occupation
##       (dbl)  (int) (dbl)    (dbl)  (int) (fctr)               (chr)
## 1     13.11  12351 11.16     68.8   1113   prof  GOV.ADMINISTRATORS
## 2     12.26  25879  4.02     69.1   1130   prof    GENERAL.MANAGERS
## 3     12.77   9271 15.70     63.4   1171   prof         ACCOUNTANTS
## 4     11.42   8865  9.11     56.8   1175   prof PURCHASING.OFFICERS
## 5     14.62   8403 11.68     73.5   2111   prof            CHEMISTS
## 6     15.64  11030  5.13     77.6   2113   prof          PHYSICISTS

Prestige ~ Education

p <- ggplot(data, aes(x=education, y=prestige))
p + geom_jitter() + stat_smooth(method = "lm") + theme_bw()

Three types of jobs

p <- ggplot(data, aes(x=education, y=prestige))
p + geom_jitter() + theme_bw() + facet_grid(. ~ type) + stat_smooth(method = "lm")  

Interpreting coefficients

fit <- lm(prestige ~ education + type, data = data)
fit.df <- tidy(fit)
print(fit.df)
##          term  estimate std.error  statistic      p.value
## 1 (Intercept) -2.698159  5.736093 -0.4703828 6.391712e-01
## 2   education  4.572793  0.671564  6.8091697 9.159877e-10
## 3    typeprof  6.142444  4.258961  1.4422401 1.525583e-01
## 4      typewc -5.458495  2.690667 -2.0286769 4.532001e-02

Let's the graph of the relationship occupational presitge, education, and job type implied by these results…

Check our results

Series of dummy variables

  • Any categorical variable can be represented by a series of dummy variables

  • Each group gets its own intercept

  • What are potential pitfalls or downsides to this (e.g., if \(k\) gets big)?

How often do we care about intercepts? What if we also want to different slopes for different groups?

continuous * dummy interactions variable

Back to education and vocabulary

Adding an interaction term

data <- mutate(data, female = as.numeric(sex=="Female"))
fit <- lm(vocabulary ~ education + female + education:female, data = data)
fit.df <- tidy(fit)
print(fit.df)
##               term    estimate   std.error  statistic      p.value
## 1      (Intercept) 1.348928559 0.080013149 16.8588360 2.294345e-63
## 2        education 0.353897438 0.005973606 59.2435195 0.000000e+00
## 3           female 0.110384052 0.110587272  0.9981624 3.182118e-01
## 4 education:female 0.007823049 0.008387855  0.9326638 3.510040e-01

Sketch the graph of the relationship between vocabulary, education, and gender implied by this result.

Put simply, what did we do?

Interactions as separate slopes

  • When you interact a dummy variable with a numeric variable, you are fitting a separate slope for each group represented by the dummy variable

  • We can look at how the terms cancel out (since female equals 0 or 1) to accomplish this:

\[ \hat{y_i} = \beta_0 + \beta_1X_{1[i]} + \beta_2X_{2[i]} + \beta_3X_{1[i]}X_{2[i]} \]

##      (Intercept)        education           female education:female 
##      1.348928559      0.353897438      0.110384052      0.007823049

Modeling question becomes, "does explanatory power of having separate slopes outweigh computational costs?""

line.size <- 2
p <- ggplot(data, aes(x=education, y=vocabulary, colour=sex))
p + geom_jitter(alpha=0.7) + 
  scale_color_manual(values = c("dark green", "dark orange")) +
  geom_abline(intercept = inter.male, slope = slope.male, colour="dark orange", size=line.size) +
  geom_abline(intercept = inter.female, slope = slope.female, colour="dark green", size=line.size) +
  expand_limits(x = 0, y = 0)

Back to occupational prestige…

data <- read.table("data/Prestige.txt", header=TRUE)
data$occupation <- rownames(data)
rownames(data) <- NULL
data <- tbl_df(data)
data <- filter(data, !is.na(type))
head(data)
## Source: local data frame [6 x 7]
## 
##   education income women prestige census   type          occupation
##       (dbl)  (int) (dbl)    (dbl)  (int) (fctr)               (chr)
## 1     13.11  12351 11.16     68.8   1113   prof  GOV.ADMINISTRATORS
## 2     12.26  25879  4.02     69.1   1130   prof    GENERAL.MANAGERS
## 3     12.77   9271 15.70     63.4   1171   prof         ACCOUNTANTS
## 4     11.42   8865  9.11     56.8   1175   prof PURCHASING.OFFICERS
## 5     14.62   8403 11.68     73.5   2111   prof            CHEMISTS
## 6     15.64  11030  5.13     77.6   2113   prof          PHYSICISTS

data <- mutate(data, prof = as.numeric(type=="prof"))
data <- mutate(data, white.collar = as.numeric(type=="wc"))
fit <- lm(prestige ~ education + prof + white.collar + education:type, data = data)
fit.df <- tidy(fit)
print(fit.df)
##                 term   estimate std.error  statistic      p.value
## 1        (Intercept)  -4.293600  8.647020 -0.4965409 6.206972e-01
## 2          education   4.763651  1.024740  4.6486436 1.112981e-05
## 3               prof  18.863655 16.888126  1.1169774 2.669126e-01
## 4       white.collar -24.383279 21.777713 -1.1196437 2.657802e-01
## 5 education:typeprof  -0.980805  1.449480 -0.6766598 5.003196e-01
## 6   education:typewc   1.670938  2.077688  0.8042294 4.233378e-01

Let's sketch the graph showing the relationship between eduation and occupational prestige implied by this result…

How did we do?

Interactions and research questions

Relationship between education and income

suppressPackageStartupMessages(library(dplyr))
getwd()
## [1] "/Users/TScott/Google Drive/Webpage/teaching/PADP8120_Fall2015/Slides"
load("data/gss_2010_training.RData")
gss.training <- tbl_df(gss.training)
gss <- select(gss.training, income06_n, educ, maeduc, paeduc, sex) %>%
  filter(!is.na(income06_n), !is.na(educ), !is.na(maeduc), !is.na(paeduc), !is.na(sex))
# NOTE: DROPPING MISSING DATA LIKE THIS CAN BE DANGEROUS
gss <- rename(gss, income = income06_n)
gss <- mutate(gss, female = as.numeric(sex=="female"))

When to include?

Brambor et al. (2006) advise: include interaction terms whenever there is a conditional hypotheses

Hypothesis: The relationship between education and income is different for women and men.

or

Hypothesis: There is a relationship between education and income.

\[\widehat{income}_i = \beta_0 + \beta_1 educ_i + \beta_2 sex_i + \beta_3 sex_i * educ_i\]

What is "conditional"? The relationship between education and income is conditional on sex.

Model results

\[\widehat{income}_i = \beta_0 + \beta_1 educ_i + \beta_2 sex_i + \beta_3 sex_i * educ_i\]

fit <- lm(income ~ educ + female + educ:female, data = gss)
tidy(fit)
##          term   estimate  std.error statistic      p.value
## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51
## 2        educ  0.5761664 0.05043841 11.423167 2.243893e-29
## 3      female -3.8500659 1.01361238 -3.798361 1.497277e-04
## 4 educ:female  0.1961346 0.07047251  2.783135 5.431051e-03

Constitutive Terms

  • Constitutive terms are the base terms that make up an interaction

  • If you are modeling \(y \sim x_1 + x_2 + x_1\times x_2\), \(x_1\) and \(x_2\) are constitutive terms

  • You might be tempted to just model \(y = x_1 \times x_2\) instead…

  • include all constitutive terms except in very rare circumstances

Why?

Let's say this is the model we specify…

fit <- lm(income ~ educ + educ:female, data = gss)
fit.df <- tidy(fit)
fit.df
##          term    estimate  std.error statistic      p.value
## 1 (Intercept)  9.21718217 0.50822134 18.136157 1.753675e-68
## 2        educ  0.71015251 0.03616492 19.636503 4.698687e-79
## 3 educ:female -0.06556639 0.01485827 -4.412788 1.071287e-05

How do we interpret this? (perhaps draw a picture…)

p1 <- ggplot(gss, aes(x=educ, y=income, col=sex))
(p1 <- p1 + geom_jitter() + theme(legend.position=c(.8,.2))+
  scale_color_manual(values = c("blue", "dark green")) +
  geom_abline(intercept = inter, slope = slope.male, col="blue") + 
  geom_abline(intercept = inter, slope = slope.female, col="dark green"))

We don't want to require that men and women have the same predicted level of income at 0 years of schooling!

When you specify a model, think carefully about what you are saying!

Correct specification

p2 <- ggplot(gss, aes(x=educ, y=income, col=sex))
(p2 <- p2 + geom_jitter() + theme(legend.position=c(.8,.2))+
  scale_color_manual(values = c("blue", "dark green")) +
  geom_abline(intercept = inter.male, slope = slope.male, col="blue") + 
  geom_abline(intercept = inter.female, slope = slope.female, col="dark green"))

Interactions and interpretation

All comes back to meaning…

fit <- lm(income ~ educ + female + educ:female, data = gss)
tidy(fit)
##          term   estimate  std.error statistic      p.value
## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51
## 2        educ  0.5761664 0.05043841 11.423167 2.243893e-29
## 3      female -3.8500659 1.01361238 -3.798361 1.497277e-04
## 4 educ:female  0.1961346 0.07047251  2.783135 5.431051e-03

If sex matters, then it would be strange to say people with one year of additional education are predicted to have 0.6 higher income (sorry that income is not measured in dollars).

The point of the model is that the difference in predicted value depends on whether the respondent is a man or a woman.

Giving a marginal effect

##          term   estimate  std.error statistic      p.value
## 1 (Intercept) 11.1928734 0.72610155 15.415025 6.483265e-51
## 2        educ  0.5761664 0.05043841 11.423167 2.243893e-29
## 3      female -3.8500659 1.01361238 -3.798361 1.497277e-04
## 4 educ:female  0.1961346 0.07047251  2.783135 5.431051e-03

Effect of education should be intepreted as:

0.576 + 0.196 * Female

It can be helpful to think about it like taking a partial derivative:

\[Income = \beta_0 + \beta_1 Educ + \beta_2 Female + \beta_3 Educ Female\]

\[\frac{dIncome}{dEduc} = \beta_1 + \beta_3 * Female\]

Meaning of marginal effect

fit <- lm(income ~ educ + maeduc + educ:maeduc, data = gss)
tidy(fit)
##          term    estimate   std.error  statistic      p.value
## 1 (Intercept)  8.44645152 1.092420927  7.7318654 1.620658e-14
## 2        educ  0.65796442 0.084565682  7.7805134 1.116219e-14
## 3      maeduc  0.12757456 0.102105208  1.2494423 2.116404e-01
## 4 educ:maeduc -0.00288615 0.007258511 -0.3976228 6.909480e-01

More importantly, what if interaction is with numeric variable?

For this model educ coefficient is predicted effect of 1 year of education when mother's education (maeduc1) is 0!

"One cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term." - Brambor et al. 2006, p. 74

  • In most tables, we only see effects when variables = 0

  • We will return to this in our discussion section…

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \beta_{21} Asian Indian * female_i + . . . + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

How many dummies needed?

Why are there 18 race/ethnicity dummy variables and 19 race/ethnicity groups studied?

  • omitted group (whites) is measured by intercept \(\beta_0\)

Model for white man:

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

Model for white female:

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

Model for Chinese-American man:

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

Model for Chinese-American female:

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

(a little difficult to makes sense of…) ##

  • "The effect of sex is always weaker among minorities than whites"
  • predicted incomes for women in 18 non-white race/ethnicity groups are higher in a model that includes an interaction term than in a model that does not include an interaction term

How would this be visible in residuals of model with no interaction?

\[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

Hint: Here's the full model \[\begin{eqnarray} log(inc_{i}) &=& \beta_0 + \nonumber \\ & & \beta_1 Chinese_i + \beta_2 Asian Indian + . . . + \beta_{18} Native American + \nonumber \\ & & \beta_{19} female_i + \nonumber \\ & & \beta_{20} Chinese_i * female_i + \beta_{21} Asian Indian * female_i + . . . + \nonumber \\ & & \epsilon_i \end{eqnarray}\]

How much does this increase how much you believe this result?

How must their code be organized in order to be able to run these checks reliably?

Image by Roger Peng

wrap-up

Questions?

Goal check

Motivation for next class

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] broom_0.3.7    ggthemes_2.2.1 ggplot2_1.0.1  dplyr_0.4.3   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1        knitr_1.11         magrittr_1.5      
##  [4] MASS_7.3-44        mnormt_1.5-3       munsell_0.4.2     
##  [7] colorspace_1.2-6   R6_2.1.1           stringr_1.0.0     
## [10] plyr_1.8.3         tools_3.2.2        parallel_3.2.2    
## [13] grid_3.2.2         gtable_0.1.2       psych_1.5.8       
## [16] DBI_0.3.1          htmltools_0.2.6    lazyeval_0.1.10   
## [19] yaml_2.1.13        assertthat_0.1     digest_0.6.8      
## [22] RColorBrewer_1.1-2 tidyr_0.3.1        reshape2_1.4.1    
## [25] formatR_1.2.1      evaluate_0.8       rmarkdown_0.8     
## [28] labeling_0.3       stringi_0.5-5      scales_0.3.0      
## [31] proto_0.3-10